Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCP/CORE: Complete outstanding RNDV reqs after all UCT lanes are destroyed #227

Merged

Conversation

dmitrygx
Copy link

@dmitrygx dmitrygx commented Dec 1, 2021

Fixes:

Error iodemo analyzer: [1638101311.245289] [DEMO] ERROR: iov data corruption (0x7f15fecfb800: expected: segment=9700 conn_id=4 seed=c13f6b8 got: segment=9700 conn_id=ffff seed=ffffffff) at 309248 position detected on [UCX-connection 0x1cc9cb0: #6 1.1.27.4:59736] (status=Success) for operation (length=316907 op="write")

found by NVDA/MLNX MTT (internal link)

@dmitrygx dmitrygx force-pushed the topic/ucp/comp_rndv_reqs_after_destroy branch from 6881b25 to 2f82417 Compare December 7, 2021 21:36
@@ -202,6 +202,15 @@ size_t ucp_tag_rndv_rts_pack(void *dest, void *arg)
return sizeof(*rndv_rts_hdr) + packed_rkey_size;
}

static void ucp_rndv_req_cancel(ucp_request_t *sreq, ucs_status_t status)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

smith like ucp_rndv_req_add_to_cancel_list

@@ -234,7 +243,7 @@ UCS_PROFILE_FUNC(ucs_status_t, ucp_proto_progress_rndv_rts, (self),
return UCS_ERR_NO_RESOURCE;
} else {
ucs_assert(UCS_STATUS_IS_ERR(status));
ucp_rndv_complete_send(sreq, status, "rts_cancel");
ucp_rndv_req_cancel(sreq, status);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert

@dmitrygx dmitrygx force-pushed the topic/ucp/comp_rndv_reqs_after_destroy branch 2 times, most recently from ee93157 to c6ecd5c Compare December 8, 2021 21:51
@dmitrygx
Copy link
Author

dmitrygx commented Dec 8, 2021

@evgeny-leksikov @brminich could you review pls?
The PR has been stabilized and I've not seen "iov data corruption" for at least 9 iterations (the MTT/io-demo reproducer is still running more iterations, will check it then).

@dmitrygx dmitrygx changed the title UCP/CORE: Complete outstanding RNDV reqs after all UCT lanes are destroyed [WIP-DNM] UCP/CORE: Complete outstanding RNDV reqs after all UCT lanes are destroyed Dec 8, 2021
@dmitrygx
Copy link
Author

dmitrygx commented Dec 9, 2021

@yosefe could you review pls?
no errors were seen during the previous nightly testing. it did 32 iterations.

@@ -490,7 +490,7 @@ void ucp_request_handle_send_error(ucp_request_t *req, ucs_status_t status)
}
} else {
if (req->flags & UCP_REQUEST_FLAG_SEND_RNDV) {
ucp_rndv_complete_send(req, UCS_ERR_CANCELED, "rndv_flush");
ucp_rndv_req_add_to_cancel_list(req, status);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if i could not send an RTS, why not release it immediately?
the receiver should not be aware of the rndv operation

@@ -202,6 +202,19 @@ size_t ucp_tag_rndv_rts_pack(void *dest, void *arg)
return sizeof(*rndv_rts_hdr) + packed_rkey_size;
}

void ucp_rndv_req_add_to_cancel_list(ucp_request_t *sreq, ucs_status_t status)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think better rename to
ucp_rndv_req_add_to_canceled_list

ucp_request_send(sreq, 0);
}
if (sreq->flags & UCP_REQUEST_FLAG_RNDV_RTS_SENT) {
ucp_rndv_req_add_to_cancel_list(sreq, UCS_ERR_CANCELED);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like we don't really send cancel now?

@dmitrygx dmitrygx force-pushed the topic/ucp/comp_rndv_reqs_after_destroy branch from 5a153cb to 6a4677f Compare December 9, 2021 19:12
@dmitrygx
Copy link
Author

the MTT/io-demo passed successfully 20 iterations during the night with the latest changes.
@yosefe could you review pls?

@dmitrygx
Copy link
Author

passed 20 more iterations

@dmitrygx dmitrygx force-pushed the topic/ucp/comp_rndv_reqs_after_destroy branch from 6a4677f to 9de9fbc Compare December 11, 2021 19:19
@dmitrygx
Copy link
Author

@yosefe thanks for the review! squashed
ok to merge?

@yosefe yosefe merged commit e4b9d9e into yosefe:integration3 Dec 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants